本PPT部分页面在Dr. Çetinkaya-Rundel的课件第二讲PPT基础上修改而成。
In May 2015 Science retracted a study of how canvassers can sway people’s opinions about gay marriage published just 5 months ago.
Science Editor-in-Chief Marcia McNutt: Original survey data not made available for independent reproduction of results. + Survey incentives misrepresented. + Sponsorship statement false.
Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.
Methods we’ll discuss today can’t prevent this, but they can make it easier to discover issues.
From the authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates:
“The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness.”
The authors informed the journal that the merge of lab results and other survey data used in the paper resulted in an error regarding the identification codes. Results of the analyses were based on the data set in which this error occurred. Further analyses established the results reported in this manuscript and interpretation of the data are not correct.
Original conclusion: Lower levels of CSF IL-6 were associated with current depression and with future depression […].
Revised conclusion: Higher levels of CSF IL-6 and IL-8 were associated with current depression […].
#1 说服研究者 to adopt a reproducible research workflow
#2 训练新手 who don’t have any other workflow
用代码完成数据分析: R
让代码可读性提高:R Markdown
版本控制: Git / GitHub
Donald Knuth “Literate Programming (1983)”
“Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer- what to do, let us concentrate rather on explaining to human beings- what we want a computer to do.”
选择windows或者mac版本。
安装成功后,登录本地客户端。
在客户端创建一个版本库,取好名字,并选择好本地存放路径。
右键版本库,在文件夹中打开。
然后可以在里面创建文件,或者将已经准备好的文件拖到里面。
通过commit可以将已经做了的修改记录下来.
确定时可以加上一些简短的说明,自己做了些什么修改。
经过确认的修改,随时可以回退。
commit之后,修改只是保存在了缓存库里面。
你可以通过publish的方式,提交服务器远端库,与别人进行分享。
每次修改后,你都可以commit,然后通过更新将修改的内容同步到远端服务器!
你也可以回退看以往保存的版本。
如果有重大的版本调整,你可以新建一个库(文件夹)重新开始。
登录你的github页面对应的库之后,你就可以看到你更新或发布的内容了。
找到上述版本库所存储的本地路径。
如果是R相关的文件,你可以用Rstudio打开,进行修改编辑。
选择:Stage
确认:Commit (可添加简要说明)
上传:Push(需要用到你的账号和密码)
在GitHub的web端你可以编辑README文件并且Commit ,写上简单的描述.
之后, 在RStudio中编辑README文件,但改的地方不同,并确定修改。
小组合作时常常出现merge conflicts, 知道如何处理非常重要.
自动生成的模板文件已经设定了头部yaml 简单的文本和代码,以及相关操作说明.通过cmd+op+I插入代码。
表示引用
编辑完成之后可以通过Rstudio上的Knit按钮进行转码打印
呈现全球国家的预期寿命(life expectancy)和人均GDP(GDP per capita)之间的关系. Hans Rosling曾经做个一个TED演讲。
下面将使用 dplyr包 (用于数据处理 data wrangling) 和 ggplot2 (用于作图,visualization) .
首先要确保这些包都安装了(installed).
在markdown中加载(Load)这些包:
library(dplyr)
library(ggplot2)gapminder <- read.csv("https://stat.duke.edu/~mc301/data/gapminder.csv")以gapminder 数据集(dataset)开始
选择年份(year)变量等于2007的案例
将筛选出来的案例存到一个新的数据集gap07
gap07 <- gapminder %>%
filter(year == 2007)任务: 呈现 gdpPercap 与lifeExp之间的关系.
qplot(x = gdpPercap, y = lifeExp, data = gap07)任务: 各个大陆的点使用不同的颜色.
qplot(x = gdpPercap, y = lifeExp, color = continent, data = gap07)任务: 在设定点的大小与人口规模成正比.
qplot(x = gdpPercap, y = lifeExp, color = continent, size = pop,data = gap07)如果我们选择1952年的数据进行做一些分析呢?
选择 1952 的 数据
绘制预期寿命 (lifeExp) 与人口的散点图(pop)
点的大小与人均GDP (gpdPercap)成正比
提示:在code适当位置添加 size = gpdPercap